1
The Performance Gap: Why Extend NumPy?
AI018 Lesson 5
00:00

While NumPy is built on C, certain compute-intensive algorithms hit a vectorization wall. This occurs when the inherent latency of Python's dynamic nature outweighs the benefits of high-level abstraction.

1. The Interpreter Tax & Boxing

Every iteration in a standard Python loop involves dynamic type-checking and reference counting. Even when using NumPy scalars, the "boxing" of raw C-data into Python objects creates a massive bottleneck for functions like $\text{logit}(p) = \log(p/(1-p))$. Handling edge cases in C is drastically faster:

>>> logit(0) -> -inf
>>> logit(1) -> inf
>>> logit(2) -> nan
>>> logit(-2) -> nan

2. Intermediate Array Bloat

Pure NumPy expressions create temporary memory buffers for each sub-operation. Extending via the C-API allows for Kernel Fusion, where the logit transform is calculated in a single pass without auxiliary memory overhead.

3. Spatial Dependencies

Operations involving neighbor-access patterns, such as the 2D stencil:

$$B(I, J) = A(I, J) + (A(I-1, J) + A(I+1, J) + A(I, J-1) + A(I, J+1)) \cdot 0.5D0 + (A(I-1, J-1) + A(I-1, J+1) + A(I+1, J-1) + A(I+1, J+1)) \cdot 0.25D0$$

are difficult to express efficiently via slicing without redundant memory copies. C-extensions allow for direct, cache-aligned pointer arithmetic.

main.py
TERMINAL bash — 80x24
> Ready. Click "Run" to execute.
>